18 research outputs found
Audio-attention discriminative language model for ASR rescoring
End-to-end approaches for automatic speech recognition (ASR) benefit from
directly modeling the probability of the word sequence given the input audio
stream in a single neural network. However, compared to conventional ASR
systems, these models typically require more data to achieve comparable
results. Well-known model adaptation techniques, to account for domain and
style adaptation, are not easily applicable to end-to-end systems. Conventional
HMM-based systems, on the other hand, have been optimized for various
production environments and use cases. In this work, we propose to combine the
benefits of end-to-end approaches with a conventional system using an
attention-based discriminative language model that learns to rescore the output
of a first-pass ASR system. We show that learning to rescore a list of
potential ASR outputs is much simpler than learning to generate the hypothesis.
The proposed model results in 8% improvement in word error rate even when the
amount of training data is a fraction of data used for training the first-pass
system.Comment: 4 pages, 1 figure, Accepted at ICASSP 202
Streaming Speech-to-Confusion Network Speech Recognition
In interactive automatic speech recognition (ASR) systems, low-latency
requirements limit the amount of search space that can be explored during
decoding, particularly in end-to-end neural ASR. In this paper, we present a
novel streaming ASR architecture that outputs a confusion network while
maintaining limited latency, as needed for interactive applications. We show
that 1-best results of our model are on par with a comparable RNN-T system,
while the richer hypothesis set allows second-pass rescoring to achieve 10-20\%
lower word error rate on the LibriSpeech task. We also show that our model
outperforms a strong RNN-T baseline on a far-field voice assistant task.Comment: Submitted to Interspeech 202
Contextual Language Model Adaptation for Conversational Agents
Statistical language models (LM) play a key role in Automatic Speech
Recognition (ASR) systems used by conversational agents. These ASR systems
should provide a high accuracy under a variety of speaking styles, domains,
vocabulary and argots. In this paper, we present a DNN-based method to adapt
the LM to each user-agent interaction based on generalized contextual
information, by predicting an optimal, context-dependent set of LM
interpolation weights. We show that this framework for contextual adaptation
provides accuracy improvements under different possible mixture LM partitions
that are relevant for both (1) Goal-oriented conversational agents where it's
natural to partition the data by the requested application and for (2) Non-goal
oriented conversational agents where the data can be partitioned using topic
labels that come from predictions of a topic classifier. We obtain a relative
WER improvement of 3% with a 1-pass decoding strategy and 6% in a 2-pass
decoding framework, over an unadapted model. We also show up to a 15% relative
improvement in recognizing named entities which is of significant value for
conversational ASR systems.Comment: Interspeech 2018 (accepted
Scaling Laws for Discriminative Speech Recognition Rescoring Models
Recent studies have found that model performance has a smooth power-law
relationship, or scaling laws, with training data and model size, for a wide
range of problems. These scaling laws allow one to choose nearly optimal data
and model sizes. We study whether this scaling property is also applicable to
second-pass rescoring, which is an important component of speech recognition
systems. We focus on RescoreBERT as the rescoring model, which uses a
pre-trained Transformer-based architecture fined tuned with an ASR
discriminative loss. Using such a rescoring model, we show that the word error
rate (WER) follows a scaling law for over two orders of magnitude as training
data and model size increase. In addition, it is found that a pre-trained model
would require less data than a randomly initialized model of the same size,
representing effective data transferred from pre-training step. This effective
data transferred is found to also follow a scaling law with the data and model
size
Discriminative Speech Recognition Rescoring with Pre-trained Language Models
Second pass rescoring is a critical component of competitive automatic speech
recognition (ASR) systems. Large language models have demonstrated their
ability in using pre-trained information for better rescoring of ASR
hypothesis. Discriminative training, directly optimizing the minimum
word-error-rate (MWER) criterion typically improves rescoring. In this study,
we propose and explore several discriminative fine-tuning schemes for
pre-trained LMs. We propose two architectures based on different pooling
strategies of output embeddings and compare with probability based MWER. We
conduct detailed comparisons between pre-trained causal and bidirectional LMs
in discriminative settings. Experiments on LibriSpeech demonstrate that all
MWER training schemes are beneficial, giving additional gains upto 8.5\% WER.
Proposed pooling variants achieve lower latency while retaining most
improvements. Finally, our study concludes that bidirectionality is better
utilized with discriminative training.Comment: ASRU 202
Prompt Tuning GPT-2 language model for parameter-efficient domain adaptation of ASR systems
Automatic Speech Recognition (ASR) systems have found their use in numerous
industrial applications in very diverse domains creating a need to adapt to new
domains with small memory and deployment overhead. In this work, we introduce
domain-prompts, a methodology that involves training a small number of domain
embedding parameters to prime a Transformer-based Language Model (LM) to a
particular domain. Using this domain-adapted LM for rescoring ASR hypotheses
can achieve 7-13% WER reduction for a new domain with just 1000 unlabeled
textual domain-specific sentences. This improvement is comparable or even
better than fully fine-tuned models even though just 0.02% of the parameters of
the base LM are updated. Additionally, our method is deployment-friendly as the
learnt domain embeddings are prefixed to the input to the model rather than
changing the base model architecture. Therefore, our method is an ideal choice
for on-the-fly adaptation of LMs used in ASR systems to progressively scale it
to new domains.Comment: Accepted at InterSpeech 202
Personalization for BERT-based Discriminative Speech Recognition Rescoring
Recognition of personalized content remains a challenge in end-to-end speech
recognition. We explore three novel approaches that use personalized content in
a neural rescoring step to improve recognition: gazetteers, prompting, and a
cross-attention based encoder-decoder model. We use internal de-identified
en-US data from interactions with a virtual voice assistant supplemented with
personalized named entities to compare these approaches. On a test set with
personalized named entities, we show that each of these approaches improves
word error rate by over 10%, against a neural rescoring baseline. We also show
that on this test set, natural language prompts can improve word error rate by
7% without any training and with a marginal loss in generalization. Overall,
gazetteers were found to perform the best with a 10% improvement in word error
rate (WER), while also improving WER on a general test set by 1%